What is text clustering?
What are the applications?
How to cluster text data?
What is text clustering?
What are the applications?
How to cluster text data?
Clustering: the process of grouping a set of objects into clusters of similar objects
Discover “natural structure” of data
Which one is not a text clustering task?
Please go to www.menti.com and use the code 7338 2184
Basic criteria
high intra-cluster similarity
low inter-cluster similarity
No (little) supervision signal about the underlying clustering structure
Need similarity/distance as guidance to form clusters
Organize document collections
Partitional clustering
Hierarchical clustering
Topic modeling
Hard clustering: Each document belongs to exactly one cluster
Soft clustering: A document can belong to more than one cluster.
Find: a partition of \(K\) clusters that optimizes the chosen partitioning criterion
Typical partitional clustering algorithms
k-means clustering
Assumes documents are real-valued vectors.
Clusters based on centroids of points in a cluster, \(c\):
\[\vec \mu(c)=\frac{1}{|c|}\sum_{\vec a \in c}{\vec x}\]
Until clustering converges (or other stopping criterion):
For each doc \(d_i\):
(Next, update the seeds to the centroid of each cluster)
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
Typical hierarchical clustering algorithms
Bottom-up agglomerative clustering
Typical hierarchical clustering algorithms
Top-down divisive clustering
Start with all data as one cluster
Repeatedly splitting the remaining clusters into two
Starts with each doc in a separate cluster
The history of merging forms a binary tree or hierarchy.
Many variants to defining closest pair of clusters (linkage methods):
Three concepts: words, topics, and documents
Documents are a collection of words and have a probability distribution over topics
Topics have a probability distribution over words
Model:
Treat data as observations that arise from a generative probabilistic process that includes hidden variables: For documents, the hidden variables reflect the thematic structure of the collection.
Infer the hidden structure using posterior inference: What are the topics that describe this collection?
Situate new data into the estimated model: How does this query or new document fit into the estimated topic structure?
Scalability
Ability to deal with various types of data
No/less assumption about input data
Minimal requirement about domain knowledge
Interpretability and usability
Internal criterion: A good clustering will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high
the inter-class similarity is low
The measured quality of a clustering depends on both the document representation and the similarity measure used
Criteria to determine whether the clusters are meaningful
Internal validation
External validation
Coherence
Inter-cluster similarity v.s. intra-cluster similarity
Davies–Bouldin index
\(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) ← Evaluate every pair of clusters
We prefer smaller DB-index!
Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data
Assesses a clustering with respect to ground truth … requires labeled data
Assume documents with \(C\) gold standard classes, while our clustering algorithms produce \(K\) clusters, \(\omega_1\), \(\omega_2\), …, \(\omega_K\) with \(n_i\) members.
Text clustering
In clustering, clusters are inferred from the data without human input (unsupervised learning)
Many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents
Evaluation